Reinforcement and imitation learning for a task

ABSTRACT

A neural network control system for controlling an agent to perform a task in a real-world environment, operates based on both image data and proprioceptive data describing the configuration of the agent. The training of the control system includes both imitation learning, using datasets generated from previous performances of the task, and reinforcement learning, based on rewards calculated from control data output by the control system.

CROSS REFERENCE TO RELATED APPLICATION

This application is a non-provisional of and claims priority to U.S.Provisional Patent Application No. 62/578,368, filed on Oct. 27, 2017,the entire contents of which are hereby incorporated by reference.

BACKGROUND

This specification relates to methods and systems for training a neuralnetwork to control an agent to carry out a task in an environment.

In a reinforcement learning (RL) system, an agent interacts with anenvironment by performing actions that are selected by the reinforcementlearning system in response to receiving observations that characterizethe current state of the environment.

Some reinforcement learning systems select the action to be performed bythe agent in response to receiving a given observation in accordancewith an output of a neural network.

Neural networks are machine learning models that employ one or morelayers of nonlinear units to predict an output for a received input.Some neural networks are deep neural networks that include one or morehidden layers in addition to an output layer. The output of each hiddenlayer is used as input to the next layer in the network, i.e., the nexthidden layer or the output layer. Each layer of the network generates anoutput from a received input in accordance with current values of arespective set of parameters.

Some neural networks are recurrent neural networks. A recurrent neuralnetwork is a neural network that receives an input sequence andgenerates an output sequence from the input sequence. In particular, arecurrent neural network can use some or all of the internal state ofthe network from a previous time step in computing an output at acurrent time step.

An example of a recurrent neural network is a long short term (LSTM)neural network that includes one or more LSTM memory blocks. Each LSTMmemory block can include one or more cells that each include an inputgate, a forget gate, and an output gate that allow the cell to storeprevious states for the cell, e.g., for use in generating a currentactivation or to be provided to other components of the LSTM neuralnetwork.

In an imitation learning (IL) system, a neural network is trained tocontrol an agent to perform a task using data characterizing instancesin which the task has previously been performed by the agent under thecontrol of an expert, such as a human user.

SUMMARY

This specification generally describes how a system implemented ascomputer programs in one or more computers in one or more locations canperform a method to train (that is, adjust the parameters of) anadaptive system (“neural network”) used to select actions to beperformed by an agent interacting with an environment.

The agent is a mechanical system (e.g., a “robot”, but it mayalternatively be a vehicle, such as one for carrying passengers)comprising one or more members connected together, for example usingjoints which permit relative motion of the members, and one or moredrive mechanisms which control the relative position of the members. Forexample, the neural network may transmit commands (instructions) to theagent in the form of commands which indicate “joint velocities”, that isthe angular rate at which the drive mechanism(s) should move one or moreof the members relative to other of the members. The agent is locatedwithin a real-world (“real”) environment. The agent may further compriseat least one drive mechanism to translate and/or rotate the agent in theenvironment under the control of the neural network. Note that in someimplementations, the agent may contain two or more disjoint portions(portions which are not connected to each other), and which actindependently based on respective commands they receive from the neuralnetwork.

As described below, the training method may make use of a simulatedagent. The simulated agent has a simulated motion within a simulatedenvironment which mimics the motion of the robot in the realenvironment. Thus the term “agent” is used to describe both the realagent (robot) and the simulated agent.

The agent (both the real agent and the simulated agent) is controlled bythe neural network to perform a task in which the agent manipulates oneor more objects which are part of the environment and which are separatefrom (i.e., not part of) the agent. The task is typically defined basedon one or more desired final positions of the object(s) following themanipulation.

In implementations described herein, the neural network makes use ofdata which comprises images of the real or simulated environment(typically including an image of at least a part of the agent) andproprioceptive data describing one or more proprioceptive features whichcharacterize the configuration of the (real or simulated) agent. Forexample, the proprioceptive features may be positions and/or velocitiesof the members of the agent, for example joint angles, and/or jointangular velocities. Additionally or alternatively they may include jointforces and/or torques and/or accelerations, for examplegravity-compensated torque feedback, and global or relative pose of anitem held by the agent.

In general terms, the implementation proposes that the neural networkreceives both image and proprioceptive data, and is trained using analgorithm having advantages of both imitation learning and reinforcementlearning.

Specifically, the training employs, for each of a plurality of previousperformances of the task, a respective dataset characterizing thecorresponding performance of the task. Each dataset describes atrajectory comprising a set of observations characterizing a set ofsuccessive states of the environment and the agent, and thecorresponding actions performed by the agent. Each previous performanceof the task may be previous performance of the task by the agent underthe control of an expert, such as a human. This is termed an “experttrajectory”. Each dataset may comprise data specifying explicitly thepositions of the objects at times during the performance of the task(e.g., in relation to coordinate axes spanning the environment) andcorresponding data specifying the position and configuration of theagent at those times, which may be in the form of proprioceptive data.In principle, alternatively or additionally, each dataset may compriseimage data encoding captured images of the environment during theperformance of the task

Furthermore, as for conventional reinforcement learning, an initialstate of the environment and agent are defined, and then the neuralnetwork generates a set of commands (i.e. at least one command, or morepreferably a plurality of successive commands). For each set ofcommands, the effect on the environment of supplying the commands to theagent (successively in the case that the set includes a plurality ofsuccessive commands) is determined, and used to generate at least onereward value indicative of how successfully the task is carried out uponimplementation of the set of commands by the agent. Based on the rewardvalues, the neural network is updated. This process is repeated fordifferent initial states.

Each set of commands results in a respective further dataset which maybe called an imitation trajectory which comprises both the actionsperformed by the agent in implementing the commands in sequence, and aset of observations of the environment while the set of commands isperformed in sequence (i.e., a sequence of states which the environmenttakes as the set of commands are successively performed). The rewardvalue may be obtained using a final state of the environment followingthe performance of the set of commands. Alternatively, more generally,it may be obtained from any one or more of the states of the environmentas the commands are implemented.

The neural network is trained (i.e., one or more parameters of theneural network are adjusted) based on the datasets, the sets of commandsand the corresponding reward values.

Thus, the training of the neural network benefits both from whateverexpert performances of the task are available, and from the results ofthe experimentation associated with reinforcement learning. Theresultant system may learn the task more successfully than a neuralnetwork trained using reinforcement learning or imitation learningalone.

The training of the neural network may be performed as part of a processwhich includes, for a certain task:

-   -   performing the task (e.g., under control of a human learner) a        plurality of times and collecting the respective datasets        characterizing the performances;    -   initializing a neural network;    -   training the neural network by the technique described above;        and    -   using the neural network to control a (real-world) agent to        perform the task in an environment.

Compared to a system which just uses imitation learning, the trainedneural network of the present disclosure will typically learn to controlthe agent to perform the task more accurately because it is trained froma larger database of examples. From another point of view, the number ofexpert trajectories needed to achieve a certain level of performance inthe task is reduced compared to using imitation learning alone, which isparticularly significant if the expert trajectories are expensive ortime-consuming to obtain. Furthermore, the neural network may eventuallylearn to perform the task using a strategy which is different from, andsuperior to, the strategy used by the human expert.

Compared to a system which just learns by reinforcement learning, thetrained neural network may make more natural motions (e.g., less jerkyones) because it benefits from examples of performing the task well.

Furthermore, compared to a system which just learns by reinforcementlearning, learning time may be improved, and the amount of computationalresources consumed reduced. This is because the first command sets whichare generated during the training procedure are more likely to besuccessful in performing the task, and thus are more informative abouthow the task can be performed. In other words, the datasets whichcharacterize the demonstration trajectories initiate high-value learningexperiences, so there is synergy between the learning based on thedatasets and the learning based on the generation of sets of commands.

The one or more parameters of the neural network may be adjusted basedon a hybrid energy function. The hybrid energy function includes both animitation reward value (r_(gail)) derived using the datasets, and a taskreward value (r_(task)) calculated using a task reward function. Thetask reward function defines task reward values based on states of theenvironment (and optionally also the agent) which result from the setsof commands.

The imitation reward value may be obtained from a discriminator network.The discriminator network may be trained using the neural network as apolicy network. Specifically, the discriminator network may be trainedto give a value indicative of a discrepancy between results commandsoutput by the neural network (policy network) and an expert policyinferred from the datasets.

Preferably, the discriminator receives input data characterizing thepositions of the objects in the environment, but preferably not imagedata and/or not proprioceptive data. Furthermore, it preferably does notreceive other information indicating the position/configuration of theagent.

Updates of the neural network, are generated using an activationfunction estimator obtained by subtracting a value function (baselinevalue) from an initial reward value, such as the hybrid energy,indicating whether the task (or at least a stage in the task) issuccessfully performed by an agent which receives the set of commands.The value function may be calculated, e.g., using a trained adaptivesystem, using data characterizing the positions and/or velocities ofagent and/or the object(s), which may be image data and proprioceptivedata, but more preferably is data directly indicating the positions ofthe object(s) and/or agent.

In principle, during the training, the sets of commands generated by theneural network could be implemented in the real work to determine howsuccessful they are controlling a real agent to perform the task.However, more preferably, the reward value is generated bycomputationally simulating a process carried out by the agent in theenvironment, starting from an initial state, based on the commands, togenerate successive simulated states of the environment and/or agent, upto a simulated final state of the environment and/or agent, andcalculating the reward value based at least on the final state of thesimulated environment. The simulated state of the environment and/oragent may be explicitly specified by “physical state data”.

The neural network itself may take many forms. In one form, the neuralnetwork may comprise a convolutional neural network which receives theimage data and from it generates convolved data. The neural network mayfurther comprise at least one adaptive component (e.g., a multilayerperceptron) which receives the output of the convolutional neuralnetwork and the proprioceptive data, and generates the commands. Theneural network may comprise at least one long short term neural network,which receives data generated both from the image data and theproprioceptive data.

Training of one or more components of the neural network may be improvedby defining an auxiliary task (i.e., a task other than the task theoverall network is trained to control the agent to perform) and trainingthose component(s) of the neural network using the auxiliary task. Thetraining using the auxiliary task may be performed before training theother adaptive portions of the neural network by the techniquesdescribed above. For example, prior to the training of the neuralnetwork as described above, there may be a step of training aconvolutional network component of the neural network as part of anadaptive system which is trained to perform an auxiliary task based onimage data. Alternatively, the training of the convolutional network tolearn the auxiliary task may be simultaneous with (e.g., interleavedwith) the training of the neural network to perform the main task.

Preferably many instances of the neural network are trained in parallelby respective independent computational processes referred to as“workers”, which may by coordinated by a computational process termed a“controller”. Certain components of the instances of the neural network(e.g., a convolutional network component) may be in common betweeninstances of the neural network. Alternatively or additionally, sets ofcommands and reward values generated in the training of the multipleneural network instances may be pooled, to permit a richer explorationof the space of possible policy models. The initial state may be (atleast with a certain probability greater than zero) a predeterminedstate which is a state comprised in one of the expert trajectories. Inother words, the expert trajectories provide a “curriculum” of initialstates. Compared to completely random initial states, states from theexpert trajectories are more likely to be representative of states whichthe trained neural network will encounter when it acts in a similar wayto the human expert. Thus, training using the curriculum of initialstates produces more relevant information than if the initial state ischosen randomly, and this can lead to a greater speed of learning.

The initial states of the environment and agent may be selectedprobabilistically. Optionally, with a non-zero probability it is a statefrom one of the expert trajectories, and with a non-zero probability itis a state generated otherwise, e.g., from a predefined distribution,e.g., one which is not dependent on the expert trajectories.

It has been found that for certain tasks it is advantageous to dividethe task into a plurality of portions (i.e., task-stages) using insight(e.g., human insight) into the task. The task stages may be used fordesigning the task reward function. Alternatively or additionally, priorto the training of the neural network, for each task-stage one or morepredefined initial states are derived (e.g., from the correspondingstages of one or more of the expert trajectories). During the training,one of the predefined initial states is selected as the initial state(at least with a certain probability greater than zero). In other words,the curriculum of initial states includes one or more states for each ofthe task stages. This may further reduce the training time, reduce thecomputational resources required and/or result in a trained neuralnetwork which is able to perform the task more accurately.

BRIEF DESCRIPTION OF THE DRAWINGS

Examples of the disclosed systems and methods will now be described forthe sake of example only with reference to the following figures, inwhich:

FIG. 1 shows a trained neural network produced by a method according tothe present disclosure;

FIG. 2 shows the system used to train the network of FIG. 1; and

FIG. 3 shows steps of a method according to the disclosure carried outby the system of FIG. 2.

DETAILED DESCRIPTION

FIG. 1 shows a system 100 according to an example of the presentdisclosure. It includes at least one image capture device (e.g., a stillcamera or video camera) 101, and at least one agent 102. The agent is a“real world” device which operates on one or more real-world objects ina real-world environment including at least one object 103. The devicemay for example be a robot, but may alternatively be a vehicle, such asan autonomous or semi-autonomous land or air or sea vehicle. Forsimplicity only a single image capture device 101, only a single object103 and only a single agent 102 are illustrated, but in other examplesmultiple image capture devices, and/or objects, and/or agents, arepossible.

The images captured by the image capture device 100 include the object103, and typically also the agent 102. The image capture device 100 mayhave a static location, at which it captures images of some or all ofthe environment. Alternatively, the image capture device 101 may becapable of being controlled to vary its field of view, under theinfluence of a control mechanism (not shown). In some implementations,the image capture device 101 may be positioned on the agent 102, so thatit moves with it. The image capture device 101 uses the images togenerate image data 104. The image data may take the form of RGB(red-green-blue) data for each of a plurality of pixels in an array.

Control signals 105 for controlling the agent 102 are generated by acontrol system 106, which is implemented as a neural network. Thecontrol system 106 receives as one input the image data 104 generated bythe image capture devices 101.

A further input to the control system 106 is data 107 characterizing thepresent positioning and present configuration of the agent 102. Thisdata is referred to as “proprioceptive” data 107.

In a first example, the agent 102 may include one or more joints whichpermit one member (component) of the agent 102 to be rotated with one,two or three degrees of freedom about another member of the agent 102(e.g., a first arm of the agent may pivot about a second arm of theagent; or a “hand” of the agent may rotate about one or more axes withrespect to an arm of the agent which supports it), and if so theproprioceptive data 107 may indicate a present angular configurations ofeach of these joints.

In a second example, the proprioceptive data 107 may indicate an amountby which a first member of the robot is translated relative to anothermember of the robot. For example, the first member may be translatablealong a track defined by the other member of the robot.

In a third example, the agent 102 may comprise one or more drivemechanisms to displace the agent as a whole (by translation and/orrotation) relative to its environment. For example, this displacementmay be an amount by which a base member of the agent 102 istranslated/rotated within the environment. The proprioceptive data 107may indicate the amount by which the agent has been displaced.

These three examples of proprioceptive data are combinable with eachother in any combination. Additionally or alternatively, theproprioceptive data 107 may comprise data indicating at least onetime-derivative of any of these three examples of proprioceptive data,e.g., a first derivative with respect to time. That is, theproprioceptive data 107 may include any one or more of the angularvelocity, in one, two or three dimensions, at which one member of therobot rotates about another member of the robot at a joint; and/or atranslational velocity of one member of the robot with respect toanother; and/or the angular or translational velocity of the robot as awhole within its environment.

The control system 106 includes a plurality of neural network layers. Inthe example of FIG. 1 it includes a convolutional network 108 whichreceives the image data 104. The convolutional network 108 may compriseone or more convolutional layers. The control system 106 furtherincludes a neural network 109 (which may be implemented as a single ormulti-layer perceptron), which receives both the proprioceptive data 107and the output of the convolutional network 108. For example, theproprioceptive data 107 may be concatenated with the output of theconvolutional network 108 to form an input string for the perceptron109.

The control system 106 further includes a recurrent neural network 110,such as an LSTM unit, which receives the output of the perceptron 109,and generates an output.

The control system 106 further includes an output layer 111 whichreceives the output of the recurrent neural network 110 and from itgenerates the control data 105. This may take the form of a desiredvelocity for each of the degrees of freedom of the agent 102. Forexample, if the degrees of freedom of the agent 102 comprise the angularpositions of one or more joints, the control data 105 may comprise dataspecifying a desired angular velocity for each degree of freedom of thejoint(s).

Optionally, the output layer 111 may generate the control data as asample from a distribution defined by the inputs to the output layer111. Thus, the control system 106 as a whole defines a stochastic policyπ_(θ) which outputs a sample from a probability distribution, over theaction space

, i.e., specifying a respective probability value for any possibleaction a in the action space. The probability distribution depends uponthe input data to the control system 106. The neural network whichimplements the control system 106 may be referred to as a “policynetwork”.

FIG. 2 illustrates a system 200 within which the control system 106 istrained. In FIG. 2 elements having the same significance as elements ofFIG. 1 are given reference numerals 100 higher. Thus, the control system106 is designated 206. The convolutional network 108 is designated 208,the perceptron 109 is designated 209, and the layers 110 and 111 aredenoted 210, 211. The method of operation of the system 200 will bedescribed in more detail below with reference to FIG. 3.

The system 200 includes a database 220 which stores “experttrajectories”. These are explained in detail below, but generallyspeaking are records of instances of an agent performing a computationaltask under the control of a human expert.

The system 200 further includes an initial state definition unit 221which defines initial states of the environment and the agent. Asdescribed below, the initial state definition unit 221 may do this byselecting a state from the database 220, or by generating a new state atrandom. The initial state is defined by “physical state data” , denoteds_(t), which specifies the positions and optionally velocities of allobjects in the environment when they are in the initial state (theinteger index t may for example signify the amount of time for which thetraining algorithm for the control system 206 has been running when theinitial state is defined), and typically also the positions/velocitiesof all members (components) of the agent. Based on the physical statedata for an initial state generated by the initial state definition unit221, the initial state definition unit 221 generates corresponding“control system state data” comprising image data 204 and proprioceptivedata 207, and transmits them respectively to the convolutional network208 and the perceptron 209. The proprioceptive data 207 of the controlsystem state data may be data comprised in the physical state data. Theimage data 204 may be generated computationally from the physical statedata, as it were captured by imaging the initial state of theenvironment using a “virtual camera”. The position and/or orientation ofthe virtual camera may be selected at random, e.g., for each respectiveinitial state.

The output of the convolutional network 208 is transmitted both to theperceptron 209, and also to an optional auxiliary task neural network224. The auxiliary task neural network 224 comprises an input neuralnetwork 225 (such as a multilayer perceptron) and an output layer 226.The output of the input neural network 225 is transmitted to the outputlayer 226 which formats it to generate an output P. As described below,the auxiliary task neural network 222 can be trained to perform anauxiliary task, for example such that the output layer 226 outputs dataP characterizing the positions of objects in the environment.

The output of the control system 206 is control data 205 suitable forcontrolling an agent. No agent is included in the system 200. Instead,the control data transmitted to a physical simulation unit 223 which hasadditionally received the physical state data s_(t) corresponding to theinitial system state from the initial state definition unit 221. Basedon the physical state data s_(t) and the control data 205, the physicalsimulation unit 223 is configured to generate simulated physical statedata s_(t+1) which indicates the configuration of the environment and ofthe agent at an immediately following time-step which would result,starting from the initial system state, if the agent 102 performed theaction defined by the control data 205.

The control data 205, and the initial physical state data s_(t)generated by the initial state definition unit 221, are transmitted to adiscriminator network 222, which may be implemented as a multi-layerperceptron. (In later steps, described below, counted by the integerindex i=1, . . . K, where K is an integer preferably greater than one,the initial physical state data s_(t) is replaced by physical state datadenoted s_(t+1) generated by the physical simulation unit 223.) Thediscriminator network 222 is explained in more detail below, but ingeneral terms it generates an output value

_(ψ)(s, a) which is indicative of how similar the action a specified bythe control data 205 is to the action which a human expert would haveinstructed the agent to make if the environment and the agent were inthe initial state.

The physical simulation unit 223 may be configured to use s_(t) and a togenerate updated physical state data s_(t+1), directly indicating thepositions/configurations of the objects and the agent in the environmentfollowing the performance of action a by the agent. From the updatedphysical state data s_(s+1), the physical simulation unit 223 generatesupdated control system state data, i.e., new image data and newproprioceptive data, and feeds these back respectively to theconvolutional network 208 and the perceptron 209, so that the controlsystem 206 can generate new control data 205. This loop continues untilupdated physical state data has been generated K times. The final batchof physical state data is physical state data s_(t+K) for the finalstate of the environment and agent. Thus, the system 200 is able togenerate a sequence of actions (defined by respective sets of controldata) for the agent to take, starting from the initial state defined bythe initial state definition unit 221.

As described below with reference to FIG. 3, during the training of thecontrol system 206, the physical state data s_(t) for the initial state,and subsequently updated physical state data s_(s+K) for the final stateof the system after K steps, are transmitted to a value neural network227 for calculating values V_(ϕ)(s_(t)) and V₉₉ (s_(t+K)). The valueneural network 227 may comprise a multilayer perceptron 228 arranged toreceive the physical state data, and a recurrent neural network 229 suchas a LSTM unit arranged to receive the output of the perceptron 228. Itfurther includes an output layer 230 to calculate the value functionV_(ϕ)(s) based on the inputs to the output layer 230.

The physical state data output by the initial state definition unit 221and the physical simulation unit 223, the values V_(ϕ)(s_(t)) andV_(ϕ)(s_(t+K)) generated by the value network 227, the outputs P of theauxiliary task network 224, and the outputs

_(ψ)(s, a) of the discriminator network 222 are all fed to an updateunit 231. The update unit 231 is configured, based on its inputs, toupdate the control system 206, e.g., the convolutional network 208, andthe layers 209, 210. The update unit 231 is also configured to updatethe auxiliary task network 224 (e.g., the unit 225), the value network227, and the discriminator network 222.

The system 200 also comprises a controller 232 which is in charge ofinitiating and terminating the training procedure.

We now turn to a description of a method 300, illustrated in FIG. 3, fortraining the control system 206 to perform a task within the system 200,and for subsequently using the system 100 of FIG. 1 to perform that taskin the real-world environment. The training employs both datasetsdescribing the performance of the task by an expert, and additionaldatasets characterizing additional simulated performances of the task bythe agent 102 under the control of the control system 206 as it learns.The training combines advantages of imitation learning and reinforcementlearning.

In a first step 301 of method 300, a plurality of datasets are obtainedcharacterizing respective instances of performance of the task by ahuman agent (“expert trajectories”). These datasets are typicallygenerated by instances in which a human controls the agent 102 toperform the task (rather than a human carrying out the task manuallywithout using the agent 102). Each dataset describing an experttrajectories contains physical state data describing the positions andoptionally velocities of the object at each of a plurality of time stepswhen the task was performed by the expert. It further contains datadefining the positions and optionally velocities of all members of theagent at these time steps (this may be in the form of proprioceptivedata). The datasets may optionally also contain corresponding image datafor each of the time steps. However, more preferably such image data isgenerated from the physical state data by the initial state definitionunit 221 later in the method (in step 302, as explained below). Thedatasets are stored in the database 220.

Optionally, the task may be divided into a number of task-stages. Forexample, if the task is one of controlling the agent 102 to pick upblocks and stack them, we define three task-stages: reaching of a block,lifting a block, and stacking blocks (i.e., placing a lifted block ontoanother block). Each of the expert trajectories may partitioned intorespective portions corresponding to the task-stages. This isadvantageous since reinforcement learning tends to be less successfulfor tasks having a long duration.

Steps 302-311 describe an iterative process 312 for training the controlsystem 206 within the system 200. The goal is to learn a visuomotorpolicy which takes as input both the image data 100 (e.g., an RGB cameraobservation) and the proprioceptive data 107 (e.g. a feature vector thatdescribes the joint positions and angular velocities). The whole networkis trained end-to-end.

In step 302, a starting state (“initial state”) of the environment andthe agent 102 is defined by the initial state definition unit 221.Shaping the distribution of initial states towards states where theoptimal policy tends to visit can greatly improve policy learning. Forthat reason, step 302 is preferably performed using, for each of thetask-stages, a respective predefined set of demonstration states(“cluster” of states) which are physical state data for the environmentand agent for states taken from the expert trajectories, such that theset of demonstration states for each task-stage is composed of stateswhich are in the portion of the corresponding expert trajectory which isassociated with that task-stage. In step 302, with probability f, theinitial state is defined randomly. That, is a possible state of theenvironment and the agent is generated virtually, e.g., by a selectionfrom a predefined probability distribution defined based on the task.Conversely, with probability 1 f a cluster is randomly selected, and theinitial state is defined as a randomly chosen demonstration state fromthat cluster. This is possible since the simulated system is fullycharacterized by the physical state data.

The initial state s_(t) defines the initial positions, orientations,configurations and optionally velocities in the simulation of theobject(s) and the agent 102.

From the initial physical state data, the initial state definition unit221 generates simulated control system state data, i.e.. image data 204which is image data which encodes an image which would be captured ifthe environment (and optionally the agent) in the initial state wereimaged by a virtual camera in an assumed (e.g., randomly chosen)position and orientation. Alternatively, the initial state definitionunit 221 may read this image data from the database 220 if it is presentthere. The initial state definition unit 221 transmits the image data204 to the convolutional network 208.

The initial state definition unit 221 further generates or reads fromdatabase 220 proprioceptive data 207 for the initial state, andtransmits this to the perceptron 209 as part of the control system statedata. Note that the image data 204 and proprioceptive data 207 may notbe sufficient, even in combination, to uniquely determine the initialstate.

The initial state definition unit 221 passes the physical state datas_(t) fully describing the initial state to the physical simulation unit223 and to the update unit 231. The physical simulation unit 223forwards the physical state data s_(t) to the discriminator network 222and the value network 227. (In a variation of the embodiment, theinitial state definition unit 221 transmits the initial the physicalstate data s_(t) directly to the discriminator network 222 and/or thevalue network 227.) The value network 227 uses s_(t) to generateV_(ϕ)(s_(t)), and transmits it to the update unit 231.

In step 303, the control system 206 generates control data 205specifying an action denoted “a_(t)”, and transmits it to the physicalsimulation unit 223 and to the discriminator network 222.

In step 304, the discriminator network 222 generates a discriminationvalue

_(ψ)(s_(t), a_(t)) which is transmitted to the update unit 231.

In step 305, the physical simulation unit 221 simulates the effect ofthe agent 102 carrying out the action at specified by the control data205, and thereby generates updated physical state data s_(t+1). Based onthe updated physical state data s_(t+1), the physical simulation unit222 generates updated control system state data (updated image data 204and updated proprioceptive data 207; the updated proprioceptive data mayin some implementations be a portion of the updated physical statedata), and transmits them respectively to the convolutional network 208and to the perceptron 209.

The physical simulation unit 221 also transmits the updated physicalstate data s_(t+1) to the update unit 231, and to the discriminatornetwork 222.

In step 306, the auxiliary task network 224 generates data P which istransmitted to the update unit 231.

In step 307 the update unit 391 uses s_(t+1) and a task reward functionto calculate a reward value. Note that the order of steps 306 and 307can be different.

In step 308, it is determined whether steps 303-307 have been performedat least K times since the last time at which step 302 was performed.Here K is a positive integer which is preferably greater than one. Notethat the case of K=1 is equivalent to omitting step 308.

If the determination in step 308 is negative, the method returns to step303 and the loop of steps is performed again, using s_(t−1) in place ofs_(t), and generating s_(t+2), instead of s_(t+1). More generally, inthe (i+1)-th time that the set of steps 304-308 is performed, s_(t+2),is used in place of s_(t), and s_(t+i+1) is generated instead ofs_(t+1).

If the determination in step 308 is positive, then in step 309 the valuenetwork 227 generates V_(ϕ)(s_(t+K)). This value is transmitted to theupdate unit 231. Note that in a variation, the value network 228 maygenerate the value V_(ϕ)(s_(t)) at this time too, instead of in step302. In step 310, the update unit 231 adjusts the parameters of thecontrol system 206 (that is, parameters of the convolutional network208, and the neural networks 209, 210), based on the discriminationvalues

_(ψ)(s, a) , the reward values obtained in step 307, and the valuesV_(ϕ)(s_(t)) and V_(ϕ)(s_(t+K)). This step is explained in more detailbelow. The update unit 231 also adjusts the parameters of theconvolutional network 208 and the network 225 of the auxiliary tasknetwork 224 based on a comparison of the K outputs P with the K statess. It also adjusts the parameters of the networks 228, 229 of the valuenetwork 227. It also adjusts the parameters of the discriminator network222.

In step 311, the controller 232 determines whether a terminationcriterion is met (e.g., that a certain amount of time has passed). Ifnot, the method returns to step 302, for the selection of a new initialstate.

If the determination in step 311 is positive, then in step 313 thetrained control system 206 is used as the control system 106 of thesystem 100, i.e., to control a real agent 102 in the real world based onreal-world images and proprioceptive data.

To explain the update carried out in step 310 we start with a briefreview of the basics of two known techniques: generative adversarialimitation learning (GAIL) and proximal policy optimization (PPO).

As noted above, imitation learning (IL) is the problem of learning abehavior policy by mimicking human trajectories (expert trajectories).The human demonstrations may be provided as a dataset of N state-actionpairs

={(s_(i), a_(i))}_(i=1, . . . N). Some imitation learning methods castthe problem as one of supervised learning, i.e., behavior cloning. Thesemethods use maximum likelihood to train a parameterized policy π_(θ):

, where

is the state space (i.e., each possible state s is a point

) and

is the action space (i.e. each possible action a is a point in

), such that θ*=arg max_(θ)Σ_(N) log π_(θ)(a_(i)|s_(i)). The policyπ_(θ) is typically a stochastic policy which, for a given input s,outputs a probability distribution over the action space

, i.e., specifying a respective probability value for any possibleaction a in the space; thus, the agent 102 is provided with control dataa which is a sample from this probability distribution.

The GAIL (“Generative adversarial imitation learning” Jonathan Ho andStefano Ermon. In NIPS, pages 4565-4573, 2016.) technique allows anagent to interact with an environment during its training and learn fromits experiences. Similar to Generative Adversarial Networks (GANs), GAILemploys two networks, a policy network π_(θ):

:→

and a discriminator network

_(ψ):

×

→[0,1].

GAIL uses a min-max objective function similar to that of GANs to trainπ_(θ) and

_(ψ)together:

$\begin{matrix}{\min\limits_{\theta}\mspace{14mu} {\max\limits_{\psi}\left( {{_{\pi_{E}}\left\lbrack {\log \; {_{\psi}\left( {s,a} \right)}} \right\rbrack} + {_{\pi_{\theta}}\left\lbrack {\log \left( {1 - {_{\psi}\left( {s,a} \right)}} \right)} \right\rbrack}} \right)}} & (1)\end{matrix}$

where π_(E) denotes the expert policy that generated the experttrajectories. This objective encourages the policy π_(θ) to have anoccupancy measure close to that of the expert policy π_(E). Thediscriminator network

_(ψ) is trained to produce a high value for states s and actions a, suchthat the action a has a high (low) probability under the expert policyπ_(E)(s) and a low (high) probability under the policy π_(θ)(s).

IL approaches works effectively if demonstrations are abundant, butrobot demonstrations can be costly and time-consuming to collect, so themethod 200 uses a process which does not require so many of them.

In continuous domains, trust region methods greatly stabilize policytraining. Recently, proximal policy optimization PPO has been proposed(John Schulman, et al, “Proximal policy optimization algorithms”, arXivpreprint, arXiv:1707.06347, 2017) as a simple and scalable technique forpolicy optimization. PPO only relies on first-order gradients and can beeasily implemented with recurrent networks in a distributed setting. PPOimplements an approximate trust region that limits the change in thepolicy per iteration. This is achieved via a regularization term basedon the Kullback-Leibler (KL) divergence, the strength of which isadjusted dynamically depending on actual change in the policy in pastiterations.

Returning to the discussion of the method 300, the step 310 adjustsπ_(θ) with policy gradient methods to maximize a hybrid reward r(s_(t),a_(t)) given by a hybrid reward function:

r(s _(t) , a _(t))=πr _(gail)(s _(t) , a _(t))+(1−λ)r _(task)(s _(r) , a_(t))λ∈ [0,1]  (2)

This reward function is a hybrid in the sense that the first term on theright represents imitation learning, using the discriminator 222, andthe second term on the right represents reinforcement learning.

Here, r_(gail) is a first reward value (an “imitation reward value”)which is a discounted sum of a reward function, defined by an imitationreward function r_(gail)(s_(t), a_(t))=−log(1−

_(ψ)(s, a)), clipped at a max value of 10.

r_(task)(s_(t), a_(t)) is a second reward value (a “task reward value”)defined by the task reward function, which may be designed by hand forthe task. For example, when doing block lifting, r_(task)(s_(t), a_(t))may be 0.125 if the hand is close to the block and 1 if the block islifted. The task reward permits reinforcement learning. Optionally, thetask reward function may be designed to be different respectivefunctions for each task-stage. For example, the task reward function maybe respective piecewise-constant functions for each of the differenttask-stages of the task. Thus, the task reward value only changes whenthe task transits from one stage to another. We have found defining sucha sparse multi-stage reward easier than handcrafting a dense shapingreward and less prone to producing suboptimal behaviors.

Maximizing the hybrid reward can be interpreted as simultaneousreinforcement and imitation learning, where the imitation rewardencourages the policy to generate trajectories closer to the experttrajectories, and the task reward encourages the policy to achieve highreturns on the task. Setting λ, to either 0 or 1 reduces this method tothe standard RL or GAIL setups. Experimentally, we found that choosingλ, to be between 0 and 1, to give a balanced contribution of the tworewards, produced a trained control system 206 which could controlagents to solve tasks that neither GAIL nor RL can solve alone. Further,the controlled agents achieved higher returns than the humandemonstrations owing to the exposure to task rewards.

Note that although in the training system 200 of FIG. 2, the controlsystem 206 being trained receives only the image data 204 and theproprioceptive data 207 (like the control system 106 in the system 100of FIG. 1), the discriminator network 222 and the value network 227receive input data specifying the positions and/or orientations and/orvelocities of objects of the simulated environment and optionally alsoof members of the simulated agent. This input data is (or is generatedfrom) the physical state data generated by the initial state definitionunit 221 and the physical simulation unit 223. Even though suchprivileged information is unavailable in the system 100, the system 200can take advantage of it when training the control system 206 insimulation.

The optional value network 227 is used by the update unit 231 to obtainvalues V_(ϕ)(s_(t)) and V_(ϕ)(s_(t+K)) which are used together withreward value calculated in step 307 to obtain an advantage functionestimator. The advantage function estimator is a parameter used toobtain a gradient estimator for r(s_(t), a_(t)). The gradient estimatormay, for example, take the form:

gradient=

_(t)[∇_(θ)log π_(θ)(a _(t) s _(t))Â _(t)]

In step 310, the update unit 231 uses the discounted sum of rewards andthe value as an advantage function estimator:

Â _(t)=Σ_(i=1) ^(K)γ^(i−1) r _(t+i)γ^(K−1) V _(ϕ)(s _(t+K))−V _(ϕ)(s_(t)),   (3)

where γ is a discount factor and r_(t+i)≡r(s_(t+i−1), a_(t+i−1)).

The value network 227 is trained in the same way as in the PPO papermentioned above. Since the policy gradient uses output of the valuenetwork to reduce variance, it is beneficial to accelerate the trainingof the value network 227. Rather than using pixels as inputs to thevalue network 227 (as for the control system (policy network) 206), asnoted above the value network 227 preferably receives the physical statedata output by the initial state definition unit 221 and the physicalsimulation unit 223 (e.g., specifying the position and velocity of thesimulated 3D object(s) and the members of the agent 102). Experimentallyit has been found that training the control system 206 and value network227 based on different input data stabilizes training and reducesoscillation of the agent's performance. Furthermore, the multilayerperceptron 228 of the value network 227 is preferably smaller (e.g.,includes fewer neurons and/or layers) than perceptron 209 of the controlsystem 206, so it converges to a trained state after fewer iterations.

As noted above, like the value network 227, the discriminator network222 preferably receives input data specifying the positions and/ororientations and/or velocities of objects of the simulated environmentand optionally also of members of the simulated agent. Furthermore,optionally, this data may be encoded in a task-specific way, i.e., astask specific features chosen based on the task. The input data to thediscriminator network 222 is formatted to indicate (directly, notimplicitly) absolute or relative positions of objects in the simulatedenvironment which are associated with the task. We refer to this as an“object centric” representation, and it leads to improved training ofthe control system 206. Thus, this object-centric input data providessalient and relevant signals to the discriminator network 222. Bycontrast, the input data to the discriminator network 222 preferablydoes not include data indicating the positions/velocities of members ofthe simulated agent. Experimentally, it has been found that includingsuch information in the input to the discriminator network 222 leads thediscriminator network 222 to focus on irrelevant aspects of the behaviorof the agent and is detrimental for training of the control system 206.Thus, preferably the input data to the discriminator network 222 onlyincludes the object-centric features as input while masking outagent-related information.

The construction of the object-centric representation is based on acertain amount of domain knowledge of the task. For example, in the caseof a task in which the agent comprises a gripper, the relative positionsof objects and displacements from the gripper to the objects usuallyprovide the most informative characterization of a task. Empirically, itwas found that the systems 100, 200 are not very sensitive to theparticular choices of object-centric features, as long as they carrysufficient task-specific information.

The function of the optional auxiliary task neural network 224 is totrain the convolutional neural network 208, while simultaneouslytraining the multilayer perceptron 225. This training is effective inincreasing the speed at which the convolutional neural network 208 istrained, while also improving the final performance. Note that theconvolution neural network 208 is preferably not trained exclusively viathe auxiliary tasks, but also in the updates to the control system 206in order to learn the policy.

The auxiliary task neural network 224 may, for example, be trained topredict the locations of objects in the simulated environment from thecamera observation. The output layer 226 may be implemented as afully-connected layer to output the 3D coordinates of objects in thetask, and the training of the neural network 225 (and optionally outputlayer 226) together with the training of the convolutional network 208is performed to minimize the 12 loss between the predicted andground-truth object locations. In step 310 all the neural networks 208,225, 209, 210, 228, 229 of system 200 are preferably updated, so that inthe method 300 as a whole they are all trained simultaneously.

The updates to the discriminator network 222 may be performed in thesame way as in the GAIL paper mentioned above.

Although the explanation of FIG. 3 above is terms of a method performedby the system 200 of FIG. 2, more preferably the method is performed inparallel by a plurality of workers under the control of the controller232. For each worker the training is performed according to steps302-310, using Eqn. (2) on a respective instance of the control system206. The convolutional network 208 may be shared between all workers.

A single discriminator network 222 may be provided shared between allthe workers. The discriminator network 222 may be produced bymaximization of an expression which is a variant of Eqn. (1) in whichthere is no minimization over θ, and in which the final term is anaverage over all the workers, rather than being based on a single policynetwork.

Exemplary resultant networks are found to improve on imitation learningand reinforcement learning in a variety of tasks.

This specification uses the term “configured” in connection with systemsand computer program components. For a system of one or more computersto be configured to perform particular operations or actions means thatthe system has installed on it software, firmware, hardware, or acombination of them that in operation cause the system to perform theoperations or actions. For one or more computer programs to beconfigured to perform particular operations or actions means that theone or more programs include instructions that, when executed by dataprocessing apparatus, cause the apparatus to perform the operations oractions.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non transitory storage medium for execution by, or to controlthe operation of, data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them. Alternatively or in addition, the programinstructions can be encoded on an artificially generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can alsobe, or further include, special purpose logic circuitry, e.g., an FPGA(field programmable gate array) or an ASIC (application specificintegrated circuit). The apparatus can optionally include, in additionto hardware, code that creates an execution environment for computerprograms, e.g., code that constitutes processor firmware, a protocolstack, a database management system, an operating system, or acombination of one or more of them.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, a script, or code, can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages; and it can be deployed in any form, including as astand alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A program may, but neednot, correspond to a file in a file system. A program can be stored in aportion of a file that holds other programs or data, e.g., one or morescripts stored in a markup language document, in a single file dedicatedto the program in question, or in multiple coordinated files, e.g.,files that store one or more modules, sub programs, or portions of code.A computer program can be deployed to be executed on one computer or onmultiple computers that are located at one site or distributed acrossmultiple sites and interconnected by a data communication network.

In this specification, the term “database” is used broadly to refer toany collection of data: the data does not need to be structured in anyparticular way, or structured at all, and it can be stored on storagedevices in one or more locations. Thus, for example, the index databasecan include multiple collections of data, each of which may be organizedand accessed differently.

Similarly, in this specification the term “engine” is used broadly torefer to a software-based system, subsystem, or process that isprogrammed to perform one or more specific functions. Generally, anengine will be implemented as one or more software modules orcomponents, installed on one or more computers in one or more locations.In some cases, one or more computers will be dedicated to a particularengine; in other cases, multiple engines can be installed and running onthe same computer or computers.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA or an ASIC, or by acombination of special purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read only memory or a random accessmemory or both. The essential elements of a computer are a centralprocessing unit for performing or executing instructions and one or morememory devices for storing instructions and data. The central processingunit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer readable media suitable for storing computer programinstructions and data include all forms of non volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto optical disks; andCD ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's device in response to requests received from the web browser.Also, a computer can interact with a user by sending text messages orother forms of message to a personal device, e.g., a smartphone that isrunning a messaging application, and receiving responsive messages fromthe user in return.

Data processing apparatus for implementing machine learning models canalso include, for example, special-purpose hardware accelerator unitsfor processing common and compute-intensive parts of machine learningtraining or production, i.e., inference, workloads.

Machine learning models can be implemented and deployed using a machinelearning framework, e.g., a TensorFlow framework, a Microsoft CognitiveToolkit framework, an Apache Singa framework, or an Apache MXNetframework.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front end component, e.g., aclient computer having a graphical user interface, a web browser, or anapp through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back end, middleware, or front end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (LAN) and a widearea network (WAN), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. In someembodiments, a server transmits data, e.g., an HTML page, to a userdevice, e.g., for purposes of displaying data to and receiving userinput from a user interacting with the device, which acts as a client.Data generated at the user device, e.g., a result of the userinteraction, can be received at the server from the device.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or on the scope of what may be claimed, but rather asdescriptions of features that may be specific to particular embodimentsof particular inventions. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially be claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

What is claimed is:
 1. A computer-implemented method of training aneural network to generate commands for controlling an agent to performa task in an environment, the method comprising: obtaining, for each ofa plurality of performances of the task, a respective datasetcharacterizing the corresponding performance of the task; and using thedataset, training a neural network to generate commands for controllingthe agent based on image data encoding captured images of theenvironment and proprioceptive data comprising one or more variablesdescribing configurations of the agent; wherein the training the neuralnetwork comprises: using the neural network to generate a plurality ofsets of one or more commands, for each set of commands generating atleast one corresponding reward value indicative of how successfully thetask is carried out upon implementation of the set of commands by theagent, and adjusting one or more parameters of the neural network basedon the datasets, the sets of commands and the corresponding rewardvalues.
 2. The method of claim 1 in which adjusting the one or moreparameters of the neural network comprises adjusting the neural networkbased on a hybrid energy function, the hybrid energy function includingboth an imitation reward value derived using the datasets and thegenerated sets of commands, and task reward term calculated using thegenerated reward values.
 3. The method of the claim 2 including usingthe datasets to generate a discriminator network, and deriving theimitation reward value using the discriminator network and the sets ofone or more commands.
 4. The method of claim 3 in which thediscriminator network receives data characterizing the positions ofobjects in the environment.
 5. The method of claim 1, in which thereward value is generated by computationally simulating a processcarried out by the agent in the environment based on the correspondingset of commands to generate a final state of the environment, andcalculating an initial reward value based at least on the final state ofthe environment.
 6. The method of claim 5, in which updates to theneural network are calculated using an activation function estimatorobtained by subtracting a value function from the initial reward value,and the initial reward value is calculated according to a task rewardfunction based on the final state of the environment.
 7. The method ofclaim 6 in which the value function is calculated using datacharacterizing the positions of objects in the environment.
 8. Themethod of claim 6 in which the value function is calculated by anadaptive model.
 9. The method of claim 1, in which the neural networkcomprises a convolutional neural network which receives the image dataand from it generates convolved data, the neural network furthercomprising at least one adaptive component which receives the output ofthe convolutional neural network and the proprioceptive data.
 10. Themethod according to claim 9 in which the adaptive component is aperceptron.
 11. The method of claim 9 in which the neural networkfurther comprises a recursive neural network, which receives input datagenerated both from the image data and the proprioceptive data.
 12. Themethod of claim 9, further including defining at least one auxiliarytask, and training the convolutional network as part of an adaptivesystem which is trained to perform the auxiliary task based on imagedata.
 13. The method of claim 1, in which the training of the neuralnetwork is performed in parallel with the training of a plurality ofadditional instances of the neural network by respective workers, theadjustment of the parameters of the neural network being additionallybased on reward values indicative of how successfully the task iscarried out by simulated agents based on sets of commands generated bythe additional neural networks.
 14. The method of claim 1, in which thestep of using the neural network to generate a plurality of sets ofcommands is performed at least once by supplying to the neural networkimage data and proprioceptive data which characterizes a stateassociated with one of the performances of the task.
 15. The method ofclaim 1, further comprising, prior to training the neural network,defining a plurality of stages of the task, and for each stage of thetask defining a respective plurality of initial states, the step ofusing the neural network to generate a plurality of sets of commandsbeing performed at least once, for each task stage, by supplying to theneural network image data and proprioceptive data which characterizesone of the corresponding plurality of initial states.
 16. A method ofperforming a task, the method comprising: training a neural network togenerate commands for controlling an agent to perform the task in anenvironment, by a method according to any preceding claim; and aplurality of times performing the steps of: (i) capturing images of anenvironment and generating image data encoding the images; (ii)capturing proprioceptive data comprising one or more variablesdescribing configurations of the agent; (iii) transmitting the imagedata and the proprioceptive data to the neural network, the neuralnetwork generating at least one command based on the image data and theproprioceptive data; and (iv) transmitting the command to the agent, theagent being operative to perform the command within the environment;whereby the neural network successively generates a sequence of commandsto control the agent to perform the task.
 17. The method of claim 16 inwhich the step of obtaining, for each of a plurality of performances ofthe task, a respective dataset characterizing the correspondingperformance of the task, is performed by controlling the agent toperform the task a plurality of times, and for each performancegenerating a respective dataset characterizing the performance.
 18. Asystem comprising one or more computers and one or more storage devicesstoring instructions that when executed by the one or more computerscause the one or more computers to perform operations comprising:obtaining, for each of a plurality of performances of the task, arespective dataset characterizing the corresponding performance of thetask; and using the dataset, training a neural network to generatecommands for controlling the agent based on image data encoding capturedimages of the environment and proprioceptive data comprising one or morevariables describing configurations of the agent wherein the trainingthe neural network comprises: using the neural network to generate aplurality of sets of one or more commands, for each set of commandsgenerating at least one corresponding reward value indicative of howsuccessfully the task is carried out upon implementation of the set ofcommands by the agent, and adjusting one or more parameters of theneural network based on the datasets, the sets of commands and thecorresponding reward values.
 19. The system of claim 18 furtherincluding: an agent operative to perform commands generated by theneural network; at least one image capture device operative to captureimages of an environment and generate image data encoding the images;and at least one device operative to capture proprioceptive datacomprising the one or more variables describing configurations of theagent.
 20. (canceled)